Spam Detection

NLP
code
NLP
Natural Language, Processing
Python, R
Author

Simone Brazzi

Published

September 5, 2024

1 Introduction

Tasks:

  • Train a model to classify spam.
  • Find the main topic in the spam emails.
  • Calculate semantic distance of the spam topics.
  • Extract the ORG from ham emilas.

2 Import

2.1 R libraries

Code
library(tidyverse, verbose = FALSE)
library(ggplot2)
library(gt)
library(reticulate)
library(plotly)

# renv::use_python("/Users/simonebrazzi/venv/blog/bin/python3")

2.2 Python packages

Code
import pandas as pd
import numpy as np
import spacy
import nltk
import string
import gensim
import gensim.corpora as corpora
import gensim.downloader

from nltk.corpus import stopwords
from gensim.models import Word2Vec
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from collections import Counter
# glove vector
glove_vector = gensim.downloader.load("glove-wiki-gigaword-300")
# nltk stopwords
nltk.download('stopwords')
True
Code
stop_words = stopwords.words('english')

2.3 Config class

Code
class Config():
  def __init__(self):
    
    """
    CLass initialize function.
    """
    
    self.path="/Users/simonebrazzi/R/blog/posts/spam_detection/spam_dataset.csv"
    self.random_state=42
  
  def get_lemmas(self, doc):
    
    """
    List comprehension to get lemmas. It performs:\n
      1. Lowercase.\n
      2. Stop words removal\n.
      3. Whitespaces removal.\n
      4. Remove the word 'subject'.\n
      5. Digits removal.\n
      6. Get only words with length >= 5.\n
    """
    
    lemmas = [
      [
        t.lemma_ for t in d 
        if (text := t.text.lower()) not in punctuation 
        and text not in stop_words 
        and not t.is_space 
        and text != "subject" 
        and not t.is_digit
        and len(text) >=5
        ]
        for d in doc
        ]
    return lemmas
  
  def get_entities(self, doc):
    
    """
    List comprehension to get the lemmas which have entity type 'ORG'.
    """
    
    entities = [
    [
      t.lemma_ for t in d 
      if (text := t.text.lower()) not in punctuation 
      and text not in stop_words 
      and not t.is_space 
      and text != "subject" 
      and not t.is_digit
      and len(text) >=5
      and t.ent_type_ == "ORG"
      ]
      for d in doc
      ]
    return entities
  
  def get_sklearn_mlp(self, activation, solver, max_iter, hidden_layer_sizes, tol):
    
    """
    It initialize the sklearn MLPClassifier.
    """
    
    mlp = mlp = MLPClassifier(
      activation=activation,
      solver=solver,
      max_iter=max_iter,
      hidden_layer_sizes=hidden_layer_sizes,
      tol=tol,
      verbose=True
      )
    return mlp
  
  def get_lda(self, corpus, id2word, num_topics, passes, workers, random_state):
    
    """
    Initialize LDA.
    """
    
    lda = gensim.models.LdaMulticore(
      corpus=corpus,
      id2word=id2word,
      num_topics=num_topics,
      passes=passes,
      workers=workers,
      random_state=self.random_state
      )
    return lda
  
config = Config()

3 Dataset

Code
df = pd.read_csv(config.path, index_col=0)

To further perform analysis also in R, here we assign the df to a variable using reticulate library.

Code
library(reticulate)
df <- py$df

3.1 EDA

First things first, Exploratory Data Analysis. Considering we are going to perform a classification, is interesting to check if our dataset is unbalanced.

Code
df_g <- df %>% 
  summarise(
    freq_abs = n(),
    freq_rel = n() / nrow(df),
    .by = label
    )

df_g %>% 
  gt() %>% 
  fmt_auto() %>% 
  cols_width(
    label ~ pct(20),
    freq_abs ~ pct(35),
    freq_rel ~ pct(35)
    ) %>% 
  cols_align(
    align = "center",
    columns = c(freq_abs, freq_rel)
  ) %>% 
  tab_header(
    title = "Label frequency",
    subtitle = "Absolute and relative frequencies"
  ) %>% 
  cols_label(
    label = "Label",
    freq_abs = "Absolute frequency",
    freq_rel = "Relative frequency"
  ) %>% 
  tab_options(
    table.width = pct(100)
    )
Table 1: Absolute and relative frequencies
Label frequency
Absolute and relative frequencies
Label Absolute frequency Relative frequency
ham 3,672  0.71
spam 1,499  0.29

It is useful also a visual cue.

Code
g <- ggplot(df_g) +
  geom_col(aes(x = freq_abs, y = label, fill = label)) +
  scale_fill_brewer(palette = "Set1", direction = -1) +
  theme_minimal() +
  ggtitle("Absolute frequency barchart") +
  xlab("Absolute frequency") +
  ylab("Label") +
  labs(fill = "Label")
ggplotly(g)
Figure 1: Absolute frequency barchart

The dataset is not balanced, so it could be relevant when training the model for classification. Depending on the model performance, we know what we could investigate first.

4 Preprocessing

In case of text, the preprocessing is fundamental: the computer does not understand the semantic or grammatical meaning of words. In case of text preprocessing, we follow the following steps:

  1. Lowercasing.
  2. Punctuation removal.
  3. Lemmatization.
  4. Tokenization.
  5. Stopwords removal.

Using SpaCy, we can applied these steps. The method nlp.pipe improves the performance and returns a generator. It yields a Doc objects, not a list. To use it as a list, it has to be defined as such. To speed up the process, is it possible to enable the multi process method in nlp.pipe. But, what does the variable nlp stand for? It load a spaCy model: we are going to use the en_core_web_lg.

  • Language: EN; english.
  • Type: CORE; vocabulary, syntax, entities, vectors
  • Genre: WEB; written text (blogs, news, comments).
  • Size: LG; large (560 mB).

Check this link for the documentation about this model.

I had chosen this model, even if it is the biggest, to get the full potential of it.

Code
# load en_core_web_lg
nlp = spacy.load("en_core_web_lg")
doc = list(nlp.pipe(df.text, n_process=4))

The specific preprocessing in this case should check these steps:

  1. Remove punctuation.
  2. Remove stop words.
  3. Remove spaces.
  4. Remove “subject” token.
  5. Lemmatization.

To improve code performance, these are the most noticible points:

  • When iterating over a collection of unique elements, set() performs better then list(). The underlying hash table structure allows for swift traversal. This is particularly noticible when the df dimension increase.
  • List comprehension, which performs better then for loops and are much more readable in some context.
  • The walrus operator :=. It is a syntax which lets assign variables in the middle of expressions. It avoids redundant calculations and improves readability.
Code
punctuation = set(string.punctuation)
stop_words = set(stop_words)

lemmas = config.get_lemmas(doc)
entities = config.get_entities(doc)

# assignment to a df column
df["lemmas"] = lemmas
df["entities"] = entities

5 Tasks

5.1 Classification

The text is already preprocessed as list of lemmas. For the classification task, it is necessary to convert it as a string.

Code
df["feature"] = df.lemmas.apply(lambda x : " ".join(x))

5.1.1 Features

As said, the machine does not understand human readable text. It has to be transformed. The best approach is to vectorize it with TfidfVectorizer(). It is a tool for converting text into a matrix of TF-IDF features. The TermFrequency-InverseDocumentFrequency is a statistical method. It is a measure of importance of a word in a document, part of a corpus, adjusted for the frequency in the corpus. The model vectorize a word by multiplying the word Term Frequency

\[ TF = \frac{word\ frequency\ in\ document}{total\ words\ in\ document} \] with the Inverse Document Frequency

\[ IDF = log(\frac{total\ number\ documents}{documents\ containing\ the\ word}) \] The final result is

\[ TF-IDF = TF * IDF \]

The resulting score represents the importance of a word. It dependes on the word frequency both in a specific document and in the corpus.

::: {.callout-note collapse:“true”} An example can be useful. If a word t appears 20 times in a document of 100 words, we have

\[ TF = \frac{20}{100}=0.2 \]

If there are 10.000 documents in the corpus and 100 documents contains the term t

\[ IDF = log(\frac{10000}{100})=2 \]

This means the score is

\[ TF-IDF=0.2*2=0.4 \] :::

Code
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['feature'])
y = df.label

5.2 Split

Not much to say about this: a best practice which let evaluate the performance of our model on new data.

Code
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=.2, random_state=42)

5.2.1 Model

The model is the MLPClassifier(). It is a Multi Perceptron Layer Classifier.

MLPClassifier

It is an Artificial Neural Network used for classification. It consists of multiple layers of nodes, called perceptrons. For further reading, see the documentation.

Code
mlp = config.get_sklearn_mlp(
  activation="logistic",
  solver="adam",
  max_iter=100,
  hidden_layer_sizes=(100,),
  tol=.005
  )

5.2.2 Fit

Code
mlp.fit(xtrain, ytrain)
MLPClassifier(activation='logistic', max_iter=100, tol=0.005, verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

5.2.3 Predict

Code
ypred = mlp.predict(xtest)

5.2.4 Classification report

Considering we are doing a classificatoin, one method to evaluate the performance is the classification report. It summarize the performance of the model comparing true and predicted labels, showing not only the metrics (precision, recall and F1-score) but also the support.

Code
cr = classification_report(
  ytest,
  ypred,
  target_names=["spam", "ham"],
  digits=4,
  output_dict=True
  )
df_cr = pd.DataFrame.from_dict(cr).reset_index()
Code
library(reticulate)

df_cr <- py$df_cr %>% dplyr::rename(names = index)
cols <- df_cr %>% colnames()
df_cr %>% 
  pivot_longer(
    cols = -names,
    names_to = "metrics",
    values_to = "values"
  ) %>% 
  pivot_wider(
    names_from = names,
    values_from = values
  ) %>% 
  gt() %>%
  tab_header(
    title = "Confusion Matrix",
    subtitle = "Sklearn MLPClassifier"
  ) %>% 
  fmt_number(
    columns = c("precision", "recall", "f1-score", "support"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    drop_trailing_dec_mark = FALSE
  ) %>% 
  cols_align(
    align = "center",
    columns = c("precision", "recall", "f1-score", "support")
  ) %>% 
  cols_align(
    align = "left",
    columns = metrics
  ) %>% 
  cols_label(
    metrics = "Metrics",
    precision = "Precision",
    recall = "Recall",
    `f1-score` = "F1-Score",
    support = "Support"
  )
Confusion Matrix
Sklearn MLPClassifier
Metrics Precision Recall F1-Score Support
spam 0.98 0.98 0.98 742.
ham 0.96 0.94 0.95 293.
accuracy 0.97 0.97 0.97 0.97
macro avg 0.97 0.96 0.97 1,035.
weighted avg 0.97 0.97 0.97 1,035.

Even if the model is not fitted for an unbalanced dataset, it is not affecting the performance. Precision and Recall are high, so much that is could seems to be overfitted.

5.3 Topic Modeling for spam content

Topic modeling in nlp can count on the Latent Dirilicht Model. It is a generative model used to get the topics which occur in a set of documents.

The LDA model has:

  • Input: a corpus of text documents, preprocessed as tokenized and cleaned words. We have this in the lemmas column.
  • Output: a distribution of topics for each document and one of words for each topic.

For further reading, you can find the paper in the Table of Contents or at this link.

5.3.1 Dataset

Filter data to have a spam dataframe and create a variable with the lemmas column to work with.

Code
spam = df[df.label == "spam"]
x = spam.lemmas

5.3.2 Create corpus

LDA algortihm needs the corpus as a bag of word.

Code
id2word = corpora.Dictionary(x)
corpus = [id2word.doc2bow(t) for t in x]

5.3.3 Model

The model will return a user defined number of topics. For each of it, it will return a user defined number of words ad the probability of each of them.

Code
lda = config.get_lda(
  corpus=corpus,
  id2word=id2word,
  num_topics=10,
  passes=10, # number of times the algorithm see the corpus
  workers=4, # parellalize
  random_state=42
  )
topic_words = lda.show_topics(num_topics=10, num_words=5, formatted=False)

# Iterate over topic_words to extract the data
data = []
data = [
    (topic, w, p)
    for topic, words in topic_words
    for w, p in words
    ]
topics_df = pd.DataFrame(data, columns=['topic', 'word', 'proba'])
Code
py$topics_df %>% 
  gt() %>% 
  tab_header(
    title = "Words and probabilities by topics"
  ) %>% 
  fmt_auto() %>% 
  cols_width(
    topic ~ pct(33),
    word ~ pct(33),
    proba ~ pct(33)
    ) %>% 
  cols_align(
    align = "center",
    columns = c(topic, word, proba)
  ) %>% 
  cols_label(
    topic = "Topic",
    word = "Word",
    proba = "Probability"
  ) %>% 
  tab_options(
    table.width = pct(100)
  )
Table 2: Topics id, words and probabilities
Words and probabilities by topics
Topic Word Probability
account 0.009
money 0.007
email 0.006
please 0.006
claim 0.006
company 0.023
statement 0.014
stock 0.012
information 0.009
report 0.008
adobe 0.01 
price 0.009
professional 0.008
microsoft 0.008
office 0.007
moopid 0.007
hotlist 0.007
image 0.005
width 0.005
email 0.005
email 0.005
international 0.005
call 0.004
business 0.004
service 0.004
price 0.008
offer 0.007
money 0.005
email 0.004
million 0.003
bingo 0.003
please 0.002
refurb 0.002
price 0.002
remove 0.002
height 0.017
width 0.013
computron 0.012
align 0.009
color 0.009
cialis 0.007
hour 0.004
brand 0.003
normally 0.002
viagra 0.002
online 0.007
prescription 0.007
click 0.005
order 0.005
viagra 0.005

5.4 Semantic distance between topics

The semantic distance requires a documents made of strings. Using a dict, we can extract the topics and the words. The dict has the topics as keys and the words as a list of words. The documents can be created using the .join().

Code
topics_dict = {}
for topics, words in topic_words:
  topics_dict[topics] = [w[0] for w in words]

documents = [" ".join(words) for words in topics_dict.values()]
Code

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
cosine_sim_matrix = cosine_similarity(tfidf_matrix, dense_output=True)
topics = list(topics_dict.keys())
cosine_sim_df = pd.DataFrame(cosine_sim_matrix, index=topics, columns=topics)
Code
py$cosine_sim_df %>% 
  gt() %>% 
  fmt_auto() %>% 
  cols_align(
    align = "center",
    columns = c(`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`)
  ) %>% 
  cols_label(
    `0` = "Topic 0",
    `1` = "Topic 1",
    `2` = "Topic 2",
    `3` = "Topic 3",
    `4` = "Topic 4",
    `5` = "Topic 5",
    `6` = "Topic 6",
    `7` = "Topic 7",
    `8` = "Topic 8",
    `9` = "Topic 9",
  ) %>% 
  tab_options(
    table.width = pct(100)
    )
Table 3: Cosine similarity matrix
Topic 0 Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9
1     0     0.109 0.105 0.305 0.177 0     0     0    
0     0     0     0     0     0     0     0     0    
0     1     0     0     0.135 0.125 0     0     0    
0.109 0     1     0.102 0.111 0     0.163 0     0    
0.105 0     0.102 1     0.108 0     0     0     0    
0.305 0.135 0.111 0.108 1     0.139 0     0     0    
0.177 0.125 0     0     0.139 1     0     0     0    
0     0     0.163 0     0     0     1     0     0    
0     0     0     0     0     0     0     1     0.153
0     0     0     0     0     0     0     0.153 1    

5.5 Organization of “HAM” mails

5.5.1 Create “HAM” df

5.5.2 Get ham lemmas which have ORG entity

Code
ham = df[df.label == "ham"]
x = ham.entities
Code
from collections import Counter

# Flatten the list of lists and create a Counter object
d = Counter([i for e in x for i in e])

freq_df = pd.DataFrame(d.items(), columns=["word", "freq"])
Code
library("wordcloud")
Loading required package: RColorBrewer
Code
library("RColorBrewer")

set.seed(42)
word_freqs_df <- py$freq_df

wordcloud(
  words = word_freqs_df$word,
  freq = word_freqs_df$freq,
  min.freq = 1,
  max.words = 100, # nrow(word_freqs_df),
  random.order = FALSE,
  rot.per = .3,
  colors = brewer.pal(n = 8, name = "Dark2")
  )
Figure 2: Wordcloud
Code
word_freqs_df %>% 
  arrange(desc(freq)) %>% 
  head(10) %>% 
  gt() %>%                                        
  tab_header(
    title = "Top 10 ham words ",
    subtitle = "by frequency"
  ) %>% 
  fmt_auto() %>% 
  cols_width(
    word ~ pct(50),
    freq ~ pct(50)
    ) %>% 
  cols_align(
    align = "center",
    columns = c(word, freq)
  ) %>% 
  cols_label(
    word = "Word",
    freq = "Frequency"
  ) %>% 
  tab_options(
    table.width = pct(100)
  )
Table 4: Top 10 words by frequency
Top 10 ham words
by frequency
Word Frequency
enron 4,257 
north   285 
america   284 
texas   196 
clynes   194 
energy   148 
aimee   131 
exxon   113 
yahoo   102 
lannou   101